[PATCH] Fix two out-of-bounds read issues when handling truncated UTF-8 input (#1005)
Two independent out-of-bounds read issues were identified in OpenCC's UTF-8
processing logic when handling malformed or truncated UTF-8 sequences.
1) MaxMatchSegmentation:
NextCharLength() could return a value larger than the remaining input size.
The previous logic subtracted this value from a size_t length counter,
potentially causing underflow and subsequent out-of-bounds reads.
2) Conversion:
Similar length handling could allow reads past the end of the input buffer
during dictionary matching, potentially propagating unintended bytes to the
conversion output.
This patch fixes both issues by:
- Explicitly tracking the end of the input buffer
- Recomputing remaining length on each iteration
- Clamping matched character and key lengths to the remaining buffer size
- Preventing reads past the null terminator
The changes preserve existing behavior for valid UTF-8 input and add test
coverage for truncated UTF-8 sequences.
These issues may have security implications when processing untrusted input
and are classified as heap out-of-bounds reads (CWE-125).
Co-authored-by: Claude <noreply@anthropic.com>
Applied-Upstream: https://github.com/BYVoid/OpenCC/commit/
345c9a50ab07018f1b4439776bad78a0d40778ec
Gbp-Pq: Topic backport
Gbp-Pq: Name
345c9a50ab07018f1b4439776bad78a0d40778ec.patch